Goto

Collaborating Authors

 kubernetes service


ChaosEater: Fully Automating Chaos Engineering with Large Language Models

Kikuta, Daisuke, Ikeuchi, Hiroki, Tajiri, Kengo, Nakano, Yuusuke

arXiv.org Artificial Intelligence

Chaos Engineering (CE) is an engineering technique aimed at improving the resiliency of distributed systems. It involves artificially injecting specific failures into a distributed system and observing its behavior in response. Based on the observation, the system can be proactively improved to handle those failures. Recent CE tools realize the automated execution of predefined CE experiments. However, defining these experiments and reconfiguring the system after the experiments still remain manual. To reduce the costs of the manual operations, we propose \textsc{ChaosEater}, a \textit{system} for automating the entire CE operations with Large Language Models (LLMs). It pre-defines the general flow according to the systematic CE cycle and assigns subdivided operations within the flow to LLMs. We assume systems based on Infrastructure as Code (IaC), wherein the system configurations and artificial failures are managed through code. Hence, the LLMs' operations in our \textit{system} correspond to software engineering tasks, including requirement definition, code generation and debugging, and testing. We validate our \textit{system} through case studies on both small and large systems. The results demonstrate that our \textit{system} significantly reduces both time and monetary costs while completing reasonable single CE cycles.


Managing GPU Costs for Production AI

#artificialintelligence

As teams integrate ML/AI models into production systems running at-scale, they're increasingly encountering a new obstacle: high GPU costs from running models in production at-scale. While GPUs are used in both model training and production inference, it's tough to yield savings or efficiencies during the training process. Training is costly because it's a time-intensive process, but fortunately, it's likely not happening every day. This blog focuses on optimizations you can make to generate cost savings while using GPUs for running inferences in production. The first part provides some general recommendations for how to more efficiently use GPUs, while the second walks through steps you can take to optimize GPU usage with commonly used architectures.


Advantages of Deploying Machine Learning models with Kubernetes

#artificialintelligence

A Machine Learning Data scientist works hard to build a model. The model helps to solve a business problem. However, when it comes to deploying the model, there are challenges like how to scale the model, how the model can interact with different services within or outside the application, how to achieve repetitive operations, etc. To overcome this problem, Kubernetes is the best fit. In this blog, I will help you understand the basics of Kubernetes, its benefits for the deployment of Machine Learning (ML) models, and how to actually do the deployment using the Azure Kubernetes service.


D2iQ Streamlines Smart Cloud-Native Application Deployments with Kaptain AI/ML 2.0

#artificialintelligence

D2iQ, the leading enterprise Kubernetes provider for smart cloud-native applications, announced version 2.0 of Kaptain AI/ML, the enterprise-ready distribution of open-source Kubeflow that enables organizations to develop, deploy, and run artificial intelligence (AI) and machine learning (ML) workloads in production environments. Powered by Kubeflow 1.5, the Kubernetes machine learning toolkit, Kaptain AI/ML now provides data science teams with features such as expanded control for mounting data volumes and increased visibility into idle notebooks, so they can spend more time developing and less time managing infrastructure. The enhanced user experience enables data scientists to more effectively manage the lifecycle of AI and ML models without the need for infrastructure knowledge and skill sets. By simplifying the deployment and full lifecycle management of AI and ML workloads at scale, Kaptain AI/ML 2.0 accelerates the impact of smart cloud-native applications. This enables organizations to drive better business results by more quickly delivering new smart products and services, becoming more agile when updating models, and driving smarter customer experiences.


A Decade of Microsoft Azure: From its Creation to its Many Successful Transformations

#artificialintelligence

Windows Azure--no, that was not a typo... "Windows" and not "Microsoft" just yet--had its debut in 2008 at The Professional Developers Conference (PDC). Ray Ozzie, Microsoft's former chief software architect, revealed Windows Azure as "a service, hosted and maintained by Microsoft on an array of distributed data centers. Then two years later--on February 1, 2010-- it was generally available. The operating system designed to become Microsoft's cloud launched as a PaaS offering (Platform as a Service). A community of developers was the first to build a precise class of web applications.


Machine learning in Palo Alto firewalls adds new protection for IoT, containers

#artificialintelligence

Palo Alto Networks has released next-generation firewall (NGFW) software that integrates machine learning to help protect enterprise traffic to and from hybrid clouds, IoT devices and the growing numbers of remote workers. The machine learning is built into the latest version of Palo Alto's firewall operating system – PAN 10.0 – to prevent real-time signatureless attacks and to quickly identify new devices – in particular IoT products – with behavior-based identification. NGFWs include traditional firewall protections like stateful packet inspection but add advanced security judgments based on application, user and content. "Security attacks are continually morphing at rapid pace and traditional signature-based security approaches cannot keep up with the millions of new devices, running a variety of operating systems and software stacks coming on the network," said Anand Oswal senior vice president and GM at Palo Alto. "IoT devices, which are growing exponentially, exacerbated that issue because they have so many of their own different agents, patches and OS's it's impossible to set security policies around them." Oswal said the ML in its new NGFW uses inline machine-learning models to identify variants of known attacks as well as many unknown cyberthreats to prevent up to 95% of zero-day malware in real time.


AIOps: Is DevOps Ready for an Infusion of Artificial Intelligence? - The New Stack

#artificialintelligence

This article is a post in a series on bringing continuous integration and deployment (CI/CD) practices to machine learning. Check back to The New Stack for future installments. With orchestration and monitoring playing such key roles in DevOps, the emerging trend of using artificial intelligence (AI) to support and even automate operations roles by delivering real-time insights about what's happening in your infrastructure seems an obvious fit. DevOps is about improving agility and flexibility; AIOps should be able to help by automating the path from development to production, predicting the effect of deployment on production and automatically responding to changes in how the production environment is performing. That's especially true as trends like microservices, hybrid cloud, edge computing and IoT increase the complexity of app infrastructures -- and the number of logs that you might have to look at to find the root cause of an issue, and the number of people who need to be in a conference call or chat room tracking down what's gone wrong and how to fix it.